Search CORE

84 research outputs found

INRIASAC: Simple Hypernym Extraction Methods

Author: Grefenstette Gregory
Publication venue
Publication date: 01/01/2015
Field of study

Given a set of terms from a given domain, how can we structure them into a taxonomy without manual intervention? This is the task 17 of SemEval 2015. Here we present our simple taxonomy structuring techniques which, despite their simplicity, ranked first in this 2015 benchmark. We use large quantities of text (English Wikipedia) and simple heuristics such as term overlap and document and sentence co-occurrence to produce hypernym lists. We describe these techniques and pre-sent an initial evaluation of results.Comment: SemEval 2015, Jun 2015, Denver, United State

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

Author: Grefenstette Gregory
Muchemi Lawrence
Publication venue
Publication date: 24/05/2016
Field of study

Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

On the Place of Text Data in Lifelogs, and Text Analysis via Semantic Facets

Author: Grefenstette Gregory
Muchemi Lawrence
Publication venue
Publication date: 15/03/2016
Field of study

Current research in lifelog data has not paid enough attention to analysis of cognitive activities in comparison to physical activities. We argue that as we look into the future, wearable devices are going to be cheaper and more prevalent and textual data will play a more significant role. Data captured by lifelogging devices will increasingly include speech and text, potentially useful in analysis of intellectual activities. Analyzing what a person hears, reads, and sees, we should be able to measure the extent of cognitive activity devoted to a certain topic or subject by a learner. Test-based lifelog records can benefit from semantic analysis tools developed for natural language processing. We show how semantic analysis of such text data can be achieved through the use of taxonomic subject facets and how these facets might be useful in quantifying cognitive activity devoted to various topics in a person's day. We are currently developing a method to automatically create taxonomic topic vocabularies that can be applied to this detection of intellectual activity

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Mining social media to create personalized recommendations for tourist visits

Author: Grefenstette Gregory
Popescu Adrian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

International audiencePhoto sharing platforms users often annotate their trip photos with landmark names. These annotations can be aggregated in order to recommend lists of popular visitor attractions similar to those found in classical tourist guides. However, individual tourist preferences can vary significantly so good recommendations should be tailored to individual tastes. Here we pose this visit personalization as a collaborative filtering problem. We mine the record of visited landmarks exposed in online user data to build a user-user similarity matrix. When a user wants to visit a new destination, a list of potentially interesting visitor attractions is produced based on the experience of like-minded users who already visited that destination. We compare our recommender to a baseline which simulates classical tourist guides on a large sample of Flickr users

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-CEA

Transforming Wikipedia into an Ontology-based Information Retrieval Search Engine for Local Experts using a Third-Party Taxonomy

Author: Grefenstette Gregory
Rafes Karima
Publication venue
Publication date: 23/05/2016
Field of study

Wikipedia is widely used for finding general information about a wide variety of topics. Its vocation is not to provide local information. For example, it provides plot, cast, and production information about a given movie, but not showing times in your local movie theatre. Here we describe how we can connect local information to Wikipedia, without altering its content. The case study we present involves finding local scientific experts. Using a third-party taxonomy, independent from Wikipedia's category hierarchy, we index information connected to our local experts, present in their activity reports, and we re-index Wikipedia content using the same taxonomy. The connections between Wikipedia pages and local expert reports are stored in a relational database, accessible through as public SPARQL endpoint. A Wikipedia gadget (or plugin) activated by the interested user, accesses the endpoint as each Wikipedia page is accessed. An additional tab on the Wikipedia page allows the user to open up a list of teams of local experts associated with the subject matter in the Wikipedia page. The technique, though presented here as a way to identify local experts, is generic, in that any third party taxonomy, can be used in this to connect Wikipedia to any non-Wikipedia data source.Comment: Joint Second Workshop on Language and Ontology \& Terminology and Knowledge Structures (LangOnto2 + TermiKS) LO2TKS, May 2016, Portoroz, Slovenia. 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

The WWW as a Resource for Lexicography

Author: Grefenstette Gregory
Publication venue: Euralex
Publication date: 01/01/2002
Field of study

International audienceUntil the appearance of the Brown Corpus with its 1 million words in the 1960s and then, on a larger scale, the British National Corpus (the BNC) with its 100 million words, the lexicographer had to rely pretty much on his or her intuition (and amassed scraps of papers) to describe how words were used. Since the task of a lexicographer was to summarize the senses and usages of a word, that person was called upon to be very well read, with a good memory, and a great sensitivity to nuance. These qualities are still and always will be needed when one must condense the description of a great variety of phenomena into a fixed amount of space. But what if this last constraint, a fixed amount of space, disappears? One can then imagine fuller descriptions of how words are used. Taking this imaginative step, the FrameNet project has begun collecting new, fuller descriptions into a new type of lexicographical resource in which '[e] ach entry will in principle provide an exhaustive account of the semantic and syntactic combinatorial properties of one "lexical unit" (i.e., one word in one of its uses).' (Fillmore & Atkins 1998) This ambition to provide an exhaustive accounting of these properties implies access to a large number of examples of words in use. Though the Brown Corpus and the British National Corpus can provide a certain number of these, the World Wide Web (WWW) presents a vastly larger collection of examples of language use. The WWW is a new resource for lexicographers in their task of describing word patterns and their meanings. In this chapter, we look at the WWW as a corpus, and see how this will change how lexicographers model word meaning

INRIA a CCSD electronic archive server

The Future of Linguistics and Lexicographers: Will there be Lexicographers in the year 3000 ?

Author: Grefenstette Gregory
Publication venue: HAL CCSD
Publication date: 01/01/1998
Field of study

International audienceLexicography has been a respected tradition of scientific pursuit since the eighteenth century.Will the science of lexicography survive into the third millennium? I will discuss what present day computational linguistics can offer the lexicographer in the lexicographic task. I will show that the data can be purified in clearer and clearer forms through approximate linguistics. I will then speculate about what aspects of the lexicographic task can be automated, what tasks remain incumbent on humans, and wonder whether a new vision of the lexicon will • emerge

INRIA a CCSD electronic archive server

Estimating the Number of Concepts

Author: Grefenstette Gregory
Publication venue: Menha Publishers
Publication date: 01/01/2010
Field of study

International audienceMost Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as Word Sense Disambigua-tion, Machine Translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are built on the word, with a few multiword expressions taken as exceptions. But future NLP systems will neces-sarily venture out into the uncharted areas of multiword expressions. The di-mensions and the topology of multiword concepts are unknown: Are there hun-dreds of thousands or tens of millions? Which words participate in multiword concepts and which do not? As the corpus grows, will their number keep on increasing? In this paper, I estimate the number of multiword concepts that are used in English, systematically probing the Web as our corpus

INRIA a CCSD electronic archive server

Taxonomy Induction using Hypernym Subsequences

Author: Biemann Chris
Cram Damien
Grefenstette Gregory
Gupta Amit
Kozareva Zornitsa
Nastase Vivi
Oakes Michael P
Ponzetto S.
Ponzetto Simone Paolo
Snow Rion
Publication venue
Publication date: 05/05/2017
Field of study

We propose a novel, semi-supervised approach towards domain taxonomy induction from an input vocabulary of seed terms. Unlike all previous approaches, which typically extract direct hypernym edges for terms, our approach utilizes a novel probabilistic framework to extract hypernym subsequences. Taxonomy induction from extracted subsequences is cast as an instance of the minimumcost flow problem on a carefully designed directed graph. Through experiments, we demonstrate that our approach outperforms stateof- the-art taxonomy induction approaches across four languages. Importantly, we also show that our approach is robust to the presence of noise in the input vocabulary. To the best of our knowledge, no previous approaches have been empirically proven to manifest noise-robustness in the input vocabulary

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref